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Abstract 


We present a novel architecture, the “stacked what-where auto-encoders” 
(SWWAE), which integrates discriminative and generative pathways and provides 
a unified approach to supervised, semi-supervised and unsupervised learning with¬ 
out relying on sampling during training. An instantiation of SWWAE uses a con¬ 
volutional net (Convnet) ( LeCun et al.|( 19981) to encode the input, and employs a 
deconvolutional net (Deconvnet) ( |Zeiler et al.| ( |2010[ )) to produce the reconstruc¬ 
tion. The objective function includes reconstruction terms that induce the hidden 
states in the Deconvnet to be similar to those of the Convnet. Each pooling layer 
produces two sets of variables; the “what” which are fed to the next layer, and 
its complementary variable “where” that are fed to the corresponding layer in the 
generative decoder. 


1 Introduction 


A desirable property of learning models is the ability to be trained in supervised, unsupervised, or 
semi-supervised mode with a single architecture and a single learning procedure. Another desirable 
property is the ability to exploit the advantageous discriminative and generative models. A popular 
approach is to pre-train auto-encoders in a layer-wise fashion, and subsequently fine-tune the entire 


stack of encoders (the feed-forward pathway) in a supervised discriminative manner (Erhan et al. 
( 2010|l; Gregor & LeCun p010|l; Henaff et al. ( 201 l| l; [Kavukcuoglu et al. ( 2009 [ 2008 2010| l; Ran- 


zato et al.|P007|l;|Ranzato & LeCun (20071). This approach fails to provide a unified mechanism 


to unsupervised and supervised learning. Another approach, that provides a unified framework for 
all three training modalities, is the deep boltzmann machine (DBM) model (Hinton et al. ( |2006) l; 
Larochelle & Bengio ( |2008| l). Each layer in a DBM is an restricted boltzmann machine (RBM), 


which can be seen as a kind of auto-encoder. Deep RBMs have all the desirable properties, however 
they exhibit poor convergence and mixing properties ultimately due to the reliance on sampling dur¬ 
ing training. The main issue with stacked auto-encoders is asymmetry. The mapping implemented 
by the feed-forward pathway is often many-to-one, for example mapping images to invariant features 
or to class labels. Conversely, the mapping implemented by the feed-back (generative) pathway is 
one-to-many, e.g. mapping class labels to image reconstructions. The common way to deal with this 
is to view the reconstruction mapping as probabilistic. This is the approach of RBMs and DBMs; 
the missing information that is required to generate an image from a category label is dreamed up 
by sampling. This sampling approach can lead to interesting visualizations, but is impractical for 
training large scale networks because it tends to produce highly noisy gradients. 


If the mapping from input to output of the feed-forward pathway were one-to-one, the mappings 
in both directions would be well-defined functions and there would be no need for sampling while 
reconstructing. But if the internal representations are to possess good invariance properties, it is 
desirable that the mapping from one layer to the next be many-to-one. Eor example, in a Convnet, 
invariance is achieved through layers of max-pooling and subsampling. 

Our model attempts to satisfy two objectives: (i)-to learn a factorized representation that encodes 
invariance and equivariance, (ii)-we want to leverage both labeled and unlabeled data to learn this 
representation in a unified framework. The main idea of the approach we propose here is very 
simple; whenever a layer implements a many-to-one mapping, we compute a set of complemen¬ 
tary variables that enable reconstruction. A schematic of our model is depicted in figure [T](b). In 
the max-pooling layers of Convnets, we view the position of the max-pooling “switches” as the 
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complementary information necessary for reconstruction. The model we proposed consists of a 
feed-forward Convnet, coupled with a feed-back Deconvnet. Each stage in this architecture is what 
we call a “what-where auto-encoder”. The encoder is a convolutional layer with ReLU followed by 
a max-pooling layer. The output of the max-pooling is the “what” variable, which is fed to the next 
layer. The complementary variables are the max-pooling “switch” positions, which can be seen as 
the “where” variables. The “what” variables inform the next layer about the content with incomplete 
information about position, while the “where” variables inform the corresponding feed-back decoder 
about where interesting (dominant) features are located. The feed-back (generative) decoder recon¬ 
structs the input by “unpooling” the “what” using the “where”, and running the result through a 
reconstructing convolutional layer. Such “what-where” convolutional auto-encoders can be stacked 
and trained jointly without requiring alternate optimization ( |Zeiler et'ar] ( |2010 l). The reconstruction 
penalty at each layer constrains the hidden states of the feed-back pathway to be close to the hidden 
states of the feed-forward pathway. The system can be trained in purely supervised manner: the 
bottom input of the feed-forward pathway is given the input, the top layer of the feed-back pathway 
is given the desired output, and the weights of the decoders are updated to minimize the sum of 
the reconstruction costs. If only the top-level cost is used, the model reverts to purely supervised 
backprop. If the hidden layer reconstruction costs are used, the model can be seen as supervised 
with a reconstruction regularization. In unsupervised mode, the top-layer label output is left un¬ 
constrained, and simply copied from the output of the feed-forward pathway. The model becomes 
a stacked convolutional auto-encoder. As with boltzmann machines (BM), the underlying learn¬ 
ing algorithm doesn’t change between the supervised and unsupervised modes and we can switch 
between different learning modalities by clamping or unclamping certain variables. Our model is 
particularly suitable when one is faced with a large amount of unlabeled data and a relatively small 
amount of labeled data. The fact that no sampling (or contrastive divergence method) is required 
gives the model good scaling properties; it is essentially just backprop in a particular architecture. 


2 Related work 


The idea of “what” and “where” has been defined previously in different ways. One related method 
was proposed known as “transforming auto-encoders” ( [Hinton et al.|P0fT) l), in which “capsule” 
units were introduced. In that work, two sets of variables are trained to encapsulate “invariance” and 
“equivariance” respectively, by providing the parameters of particular transformation states to the 
network. Our work is carried out in a more unsupervised fashion in that it doesn’t require the true 
latent state while still being able to encode similar representations within the “what” and “where”. 
Switches information is also made use of by some visualization work such as Zeiler et al. ( 2010| l, 
while such work only has a generative pass and merely uses a feed-forward pass as an initialization 
step. 


Similar definitions have been applied to learn invariant features (Gregor & LeCun|(|2010|l; Henaff 


et al. 


( 2011|l; Kavukcuoglu et al. (2009 2008 2010j l; Ranzato et al. pOOTj ir Ranzato & LeCun 


( |2007| i qMakhzani & Erey ( 2014| l; Masci et al. ( |2()ll| l). Among them, most works merely shed light 
to unsupervised feature learning and therefore failed to unify different learning modalities. Another 
relevant hierarchical architecture is proposed in (Ranzato et al. ( 2007| l; Ranzato & LeCun (2007 1 ), 
however, because this architecture is trained in a layer-wise greedy manner, its performance is not 
competitive with jointly trained models. 


In terms of joint loss minimization and semi-supervised learning, our work can be linked to Weston 
et al. ( 2012| l and Ranzato & Szummer ( |2008| ), with the main advantage being the easiness to extend a 
Convnet with a Deconvnet and thereby enabling the utilization of unlabeled data. [Paine et aL] ( |2014| ) 
has analyzed the regularization effect with similar architectures in a layer-wise fashion. 


One recent work (Rasmus et al. ( 2015b[ l, Rasmus et al. ( 2015aj l) has been proposed to adopt deep 
auto-encoders to support supervised learning in which completely different strategy is employed to 
harness the lateral connection between same stage encoder-decoder pairs, however. In that work, 
decoders receive the entire pre-pooled activation state from the encoder, whereas decoders from 
SWWAE only receive the “where” state from the corresponding encoder stages. Eurther, due to a 
lack of unpooling mechanism incorporated in the Ladder networks, it is restricted to only reconstruct 
the top layer within generative pathway (E model), which looses the ’’ladder” structure. By contrast, 
SWWAE doesn’t suffer from such necessity. 
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3 Model Architecture 

We consider the loss function of SWWAE depicted in figure[TJb) composed of three parts: 

L = + ^L2recLL2rec + ^L2mLl2M, ( 1 ) 

where Latll is the discriminative loss, LL 2 rec is the reconstruction loss at the input level and Ll 2 M 
charges intermediate reconstruction terms. A’s weight the losses against each other. 

Pooling layers in the encoder split information into “what” and “where” components, depicted in fig- 
ure[TJa), that “what” is essentially max and “where” carries argmax, i.e., the switches of maximally 
activation defined under local coordinate frame over each pooling region. The “what” component is 
fed upward through the encoder, while the “where” is fed through lateral connections to the same 
stage in the feed-back decoding pathway. The decoder uses convolution and “unpooling” opera¬ 
tions to approximately invert the output of the encoder and reproduce the input, shown in figure 
The unpooling layers use the “where” variables to unpool the feature maps by placing the “what” 
into the positions indicated the preserved switches. We use negative log-likelihood (NLL) loss for 
classification and L2 loss for reconstructions; e.g, 

I^L2rec — ||^ ^l|2; ^L2M — ||^m :^m ||25 (2) 

where LL 2 rec denotes the reconstruction loss at input-level and Ll 2 M denotes the middle recon¬ 
struction loss. In our notation, x represents the input (no subscripts) and Xi (with subscripts) repre¬ 
sent the feature map activations of the Convnet, respectively. Similarly, x and Xm are the input and 
activations of the Deconvnet, respectively. The entire model architecture is shown in figure [TJb). 
Notice in the following, we may use Ll 2 * to represent the weighted sum of Li^ 2 rec and El 2 M- 
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Figure 1: Left (a): pooling-unpooling. Right (b): model architecture. For brevity, fully-connected 
layers are omitted in this figure. 


3.1 Soft version “what” and “where” 


Recently, Goroshin et al. (20151 introduces a soft version of max and argmax operators within each 
pooling region: 
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where z{x^ y) denotes activation on the feature maps and x, y represent spatial location which take 
normalized values from -1 to 1. Nk stands for the pooling region. Note that is a hyper¬ 
parameter that is always set to be non-negative. It parametrizes soft pooling in such a way that 
the larger the /3, the closer the soft-pooling approaches max-pooling, while small (3 approximates 
mean-pooling. We use interpolation in the unpooling stage to handle continuous value conveyed by 
“where”. 

The soft pooling and unpooling can be embedded seamlessly into the SWWAE model and it has 
the virtue such that it can backpropogate through p, in the contrast to the hard max-pooling being 
not differentiable w.r.t the argmax “switch” locations. Furthermore, soft-pooling operators enable 
location information to be more accurately represented and thus enable the features to capture fine 
details about the input, as evidenced in our visualization experiments (see section |4~2)i. 


3.2 Training with ioint losses and regularization 


As we mentioned, the SWWAE provides a unified framework for learning with all three learning 
modalities, all within a single architecture and single learning algorithm, i.e. stochastic gradient 
descent and backprop. Switching between these modalities can be achieved as follows: 

• for supervised learning, we can mask out the entire Deconvnet pathway by setting Al 2 * to 
0 and the SWWAE falls back to vanilla Convnet. 


• for unsupervised learning, we can nullify the fully-connected layers on top of Convnet 
together with softmax classifier by setting Xnll = 0. In this setting, the SWWAE is 
equivalent to a deep convolutional auto-encoder. 


• for semi-supervised learning, all three terms of the loss are active. The gradient contribu¬ 
tions from the Deconvnet can be interpreted as an information preserving regularizer. 


The idea behind using reconstruction as a regularizer was studied previously in Erhan et al. ( 2010| l, 
although it uses unsupervised pre-training as its setup. In terms of this, SWWAE is connected to 
unsupervised pre-training in the sense that both paradigms attempt to provide better generalization 
by forcing the model to reconstruct. One argument of unsupervised learning acting as a regular¬ 
izer is that supervised loss drives to model P{Y \ X), while unsupervised pre-train ing captures 
the input distribution of P{X)-, and learning P{X) is helpful to learning P{Y \ X) (Erhan et al. 
( |2010| l). However, we argue that applying this statement to unsupervised pre-training setup appears 
unconvincing. One can argue that using P{X) merely to initialize the model for learning P{Y \ X) 
has a very weak effect; i.e. the gradients from learning P{Y \ X) completely overwrite the initial 
weights, thus eliminating any regularizing effect that may have been obtained from learning P{X). 
We argue that joint training is a more effective strategy, i.e. SWWAE; our approach tries to model 
P{Y I X) together with P{X) jointly during training. Comparisons between different regularizers 
are shown in appendix. 

Moreover, training jointly with multiple losses helps avoid collapsing or learning trivial represen¬ 
tation. For one thing, a common issue with auto-encoders is that they learn little more than the 
identity function; e.g, copying input to get perfect reconstruction. For another, sparse auto-encoders 
(Makhzani & Frey ( |2014[ ), Makhzani & Frey| (2013 i) attain a well known trivial solutions: adding 
an Li penalty on the hidden layers is likely to scale down the encoder weights and scale up the de¬ 
coders weight in order to reconstruct while achieving small activations. We argue that a direct way 
to avoid such trivial solutions is to include a supervised loss, which directly optimizes a non-trivial, 
useful, criterion that helps factorize the data into semantically relevant factors of variation. 


3.3 Intermediate L2 constraints 

The reasons for adding intermediate L2 reconstruction terms are listed as follow. First, it prevents 
the feature planes from being shuffled so that the “where” map conveyed from encoder are 
guaranteed to match the “what” from decoder Otherwise, the unpooling may see “what” and 
“where” with shuffle orders, and hence cannot work properly. Second, in particular when training 
with classification loss, intermediate terms disallow the scenario that upper layers become idle while 
only lower layers are busy at reconstructing, in which case filters from those unemployed layers are 
not regularized. The related classification performance comparison about intermediate L2 terms is 


4 















Workshop track - ICLR 2016 


shown in appendix. Third, As a correspondence to layer-wise auto-encoder training, each interme¬ 
diate encoder/pool/unpool/decoder units in SWWAE, combined with intermediate L2 terms, can be 
seen as a single-layer convolutional auto-encoder (Masci et al. (20111). 


4 Experiments 


We use the following notation to describe our architecture (assume square kernels) e.g. 
(16) 5c- (32) 3c-2p-10fc, in which ‘ (16) 5c’ denotes convolution layer with 16 feature maps 
while kernel size being set to 5. 2p denotes 2x2 pooling layer and lOfc denotes fully-connection 
layer that connects to 10 hidden units. ReLU is omitted in the notation. 

4.1 Necessity OF “WHERE” 

We address the necessity of “where” by showing the difference of reconstructions using “where” 
versus not using “where”. Upsampling is an alternative way to do unpooling but without dreaming 
up “where”, in the respect that “what” is agnostic about “where” and hence it gets copied on all 
the positions. Figure ^displays a group of reconstructed digits sampled from MNIST’s testing set 
which are generated by a trained SWWAE using MNIST training set. The architecture we use is; 
(16) 5c- (32) 3c-Xp and the pooling size being experimented varies from 2 to 16. Note we use 
hard max-pooling for this experiment and the architecture is trained in unsupervised mode. 
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Figure 2: Generation quality comparison between using upsampling (left) and unpooling (right). 
From top to bottom, the pooling sizes are respectively 2, 4, 8, 16. 

On one hand, as the generations given by unpooling are obviously clearer and cleaner than the 
ones by upsampling, this experiment demonstrates that “where” is critical information demanded 
by reconstructing; one can barely obtain well reconstructed images without preserving “where”. On 
the other hand, this experiment can also be considered as an example using SWWAE for generative 
purpose. 


4.2 Invariance and Equivariance 


In this section, we examine the relationship between “what” and “where” by using the visualiza¬ 
tion approach proposed with transforming auto-encoders ( [Hinton et al.| ( |20I Ij l) in which a number 
of “capsules” are trained to learn a representation consisting of equivariant and invariant compo¬ 
nents. Analogously, the “what” and “where” in our model’s representation correspond to the in¬ 
variant and equivariant components, respectively. The experiment recipe is stated as follow. (1) 
train a SWWAE using horizontally and vertically translated MNSIT digits from training set; (2) 
feed untranslated digits from testing set into SWWAE and obtain the “what” {R) and “where” 
{R^y, (3) horizontally or vertically translate same set of digits and feed it into SWWAE and cache 
“what” and “where” correspondingly; (4) plot the relationship between “what” and “where” ob¬ 
tained from translated digits versus untranslated ones, shown in figure The architecture we use 
is: (32) 5c- (32) 3c-2p- (32) 3c-l 6p and we use soft pooling/unpooling with /3 = 100. (5) since 
this experiment demands a large pooling size, we hence plot the generations in figure |^to make sure 
that SWWAE works appropriately under such large pooling settings. 


We draw the conclusion from figure |3] that “what” and “where” behave much like the invariance 
and equivariance of capsules in jHinton et al. ( 2011| l. One one hand, “where” learns highly localized 
representation. Each element in the R^ “where” has an approximately linear response to the pixel- 
level translation on either horizontal/vertical direction and learns to be invariant to another. On 
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Figure 3: Scatter plots depicting feature response produced by translating the input. The horizontal 
axis represents the “what” or “where” output from one feature plane for an untranslated digit image; 
vertical axis represents the “what” or “where” output from the same feature plane if that image 
is translated by +3 or -3 pixels in either horizontal or vertical direction. From left to right, the 
figures are respectively: first (a): “what” of horizontally translated digits versus original digits; 
second (b): “where” of horizontally translated digits versus original digits; third (c): “what” of 
vertically translated digits versus original digits; fourth (d): “where” of vertically translated digits 
versus original digits. Note that circles are used to feature +3 translation and triangles for -3. In the 
“where” related plots, x and y denote two dimensions of “where” respectively. 



Figure 4: Reconstructed MNIST digits in the capsule emulation experiments. The top row shows 
original input; second row shows the reconstruction of those original inputs; the bottom two rows 
display reconstruction of horizontally translated digits in positive and negative direction respectively. 


the other hand, “what” learns to be locally stable that exhibits strong invariance to the input-level 
translation. 


4.3 Classification PERFORMANCE 




Figure 5: Validation-error v.s. \l 2 * on a range of datasets for SWWAE semi-supervised experi¬ 
ments. Left (a): MNIST. Right (b): SVHN. Different curves denote different number of labels being 
used. 
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Table 1: Comparison between SWWAE and other best published results on SVHN with 1000 labels. 


model / N 


error rate (in %) 


KNN 


TS VM I Va pnik & Vapnik 1 1998 1 ) 
Ml+KN^ Kingma et at. 1 2Ut4|l) 
Ml+TS ViyrdKingma et alljM ^) 
M1+M2 I Kingma et al.|((A ^7 


77.93 

66.55 

65.63 

54.33 

36.02 


SWWAE without dropout (Al 2 * = 0.8) 27.83 

SWWAE with dropout (Xl2* = 0.4) 23.56 


4.3.1 MNIST&SVHN 

As a start, we access the effect of SWWAE on classification by performing both semi-supervised and 
supervised experiments on MNIST and SVHN. We attempt to demonstrate that introducing a paired 
Deconvnet with a group of reconstruction losses can help generalization and provide an effective 
solution to make use of unlabeled data. Note in the classification experiments, we use the hard 
version pooling because it performs better than its soft counterparts in terms of classification. 

We start by constructing semi-supervised datasets for both two datasets. MNIST dataset consists 
of images of 10 different classes (0 to 9) of size 32x32 with 60,000 training samples and 10,000 
test samples. We follow the previous work for data preparation; randomly select labeled samples 
from training set while the rest of the samples is used without labels The sizes of labeled subset 
are respectively 100, 600, 1000, 3000 and we ensure each class has same number of digits chosen 
in the labeled set. SVHN dataset consists of 73,257 digits for training, 26,032 digits for testing 
and 53,1131 extra training samples that are less difficult. Likewise, we construct labeled dataset 
for SVHN that contains 1000 samples uniformly distributed in 10 classes, chosen randomly from 
the non-extra training set. In order to attain reliable results, we run each experiment several rounds 
whereby datasets are refreshed before each round and we average the performances of all rounds as 
the final evaluation. 

We approach the “standalone” regularization effect of SWWAE on both datasets, by plotting the 
validation error v.s. Xl 2 * (Al 2 M and XL 2 rec are combined to be equal for this experiment) in figure 
1^ By “standalone”, we mean that no other well-known regularize!' is applied. 

We further evaluate SWWAE on the testing set of SVHN with the chosen hyper¬ 
parameters indicated by validation error. Table [T] shows the results. We addition¬ 
ally evaluate SWWAE on SVHN under pure supervised manner (with all the avail¬ 
able labels) that we find that the testing error decreases from 5 . 89 % to 4 . 94 % yielded 
by SWWAE versus a vanilla Convnet under same configuration. The architecture we 
use for MNIST and SVHN are respectively ( 64 ) 5c-2p-(64) 3c-2p-(64) 3c-2p-10fc and 
(128) 5c-2p-(128) 3c-(256) 3c-2p-(256) 3c-2p-10fc. More exploration on MNIST is 
shown in appendix. 


4.3.2 STL-10 


STL-10 contains larger 96x96 pixel images and relatively less labeled data (5000 training sam¬ 
ples, 100,000 unlabeled samples and 8,000 test samples). The training set is mapped to 10 pre¬ 
defined folds with 1,000 images each. Therefore, STL-10 has a 100:1 ratio of the amount of un¬ 
labeled samples to the labeled ones in each fold. We follow the testing protocol of STL-10 that 
we first tune the hyper-parameters for each fold by validation error and let the best performed 
model predict the testing set. The final score is reported by averaging the testing score of 10 
folds. Eor STL-10, we access the possibility to combine batch normalization (Ioffe & Szegedy 


( |2015| l) and SWWAE. Eurthermore, we carry out spatial batch normalization which preserves the 
mean and standard deviation from each feature map while they get normalized independently 
based on their own statistics. We devise a VGG-style (|Simonyan & Zisserman|(|2014|)) deep net, 
(64)3c-4p- (64)3c-3p-(12 8)3c-(128)3c-2p-(256)3c-(256) 3c-(256)3c-(512)3c- 
(512) 3 c-(512) 3c-2p-10f c and each convolution layer is followed by a spatial batch normaliza¬ 
tion layer, which is applied in both Convnet and Deconvnet pathways. Results are shown in table 

El 
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Table 2: Comparison between SWWAE and other best published results on STL-10. 


model accuracy 


Multi-task Bayesian Optimization ( 

Swersky et al. 

2013 i) 

70.1% 

Zero-bias Convnets -l- ADCU i Paine et al. (fZOTT 1) 


70.2% 

Exemplar Gonvnets (Dosovitskiy et al.|(|2014|) 


75.4% 

SWWAE 

Convnet of same configuration 



74.33% 

57.45% 


Table 3: Accuracy of SWWAE on CIEAR-10 and CIEAR-100 in comparison with best published 
single-model results. Our results are obtained with the common experimental setting that we only 
adopt contrast normalization, small translation and horizontal mirroring for data preprocessing. 


model 


CIFAR-10 CIFAR-100 


All-Convnet i Springenberg et al. 

2014H 


92.75% 

66.29% 

Highway Network i Snvastava et al. (^2013 

0 

92.40% 

67.76% 

Deeply-supervised nets ( Lee et al 

^idl4|) 


92.03% 

65.43% 


SWWAE (Ai2rec = 1, \l2M = 0.2) 
Convnet of same configuration 


92.23% 

91.33% 


69.12% 

67.50% 


4.4 Large scale experiments 


4.4.1 CIEAR WITH 80 MILLION TINY IMAGES 


The dataset CIEAR-10 and CIEAR-100 are sampled and labeled from the 80 million tiny images 
dataset (Torralba et al. ( |2008| l). Both datasets contain 60,000 32x32 images which are small por¬ 
tions of the set of 80 million images. In contrast to the former classification experiments, this 
experiment involves substantially more abundant unlabeled data in relation to the amount of la¬ 
beled data. We carry out the SWWAE with a VGG-style network (|Simonyan & Zisserman|(|2014|l): 
(128)3c-(256)3c-2p-(256)3c-(512)3c-2p-(512)3c-(512)3c-2p-(512)3c-(512)3c- 
2p-128f c-lOf c in which each convolution is bundled and followed by spatial batch normalization 
(Ioffe & Szegedy ( 20I5| l) in both Convnet and Deconvnet. To compare with results from other ap¬ 
proaches, we perform the experiments in the common experimental setting that only adopts contrast 
normalization, small translation and horizontal mirroring for data preprocessing. The results are 
shown in tabled 


5 Conclusion and outlook 

The overall system, which can be seen as pairing a Convnet with a Deconvnet, yields good accuracy 
on a variety of semi-supervised and supervised tasks. We envision that such architecture may also 
be useful in video related tasks where unlabeled samples abound. 
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Appendix: more MNIST 


We tend to exhibit more experimental results on MNIST in two respects. First, on the validation 
set, we compare the performance of SWWAE against other regularization methods, shown in table 
[4| Note in ord er to make the comparison more realistic and closer to practical uses, we add dropout 
( |Hinton et al.| ( |201^ ) at fully-connected layers as the default for this set of comparisons. Regulariz- 
ers under comparison include dropout on the convolution layers and LI sparsity penalty on hidden 
layers. Besides, we also train SWWAE unsupervisedly and separately train a softmax classifier af¬ 
terwards using labeled samples; this disjointly trained architecture is denoted by “unsup-sfx”. We 
similarly try using SWWAE as an unsupervised pre-training approach, followed by fine-tuning the 
entire Convnet part driven by labeled data, which is denoted by “unsup-pretr”. Note the difference 
between “unsup-pretr” and “unsup-sfx” lies in if the Convnet part is frozen when training the soft- 
max classifier on top. In addition, “noL2M” is written for experiments that SWWAE is trained with 
only reconstruction loss at the input level, i.e. Xl 2 M = 0 and \L 2 rec is chosen by validation error. 
Second, we report the testing set error rate obtained by SWWAE with chosen hyper-parameter of 
SWWAE and compare it with best published results in table Note that for the experiments on 
MNIST testing set, the labeled set is generated by sampling from the entire MNIST training set; the 
experiments on validation set, instead, sample the labeled data only from a subset of the MNIST 
training set because the rest of which is deemed as validation set. The SWWAE configuration is 
(64)5c-2p-(64)3c-2p-(64)3c-2p-10fc. 


Table 4: Comparison against other regularization approaches and disjoint training approaches on 
MNIST dataset. The scores are validation error rate (in %). Dropout is added at the fully-connected 
layers as default. 


model/N 100 600 1000 3000 


SWWAE 

dropout on convolution 
LI 

unsup-sfx 

unsup-pretr 

noL2M 


10.66 ±0.55 

14.23 ±0.94 
10.91 ±0.29 
17.81 ±0.06 

12.41 ± 1.95 


4.35 ±0.30 

4.70 ± 0.38 
4.61 ±0.28 

8.41 ±0.08 
9.80 ±0.06 
4.63 ±0.24 


3.17±0.17 
3.37 ±0.11 
3.55 ±0.31 
6.40 ±0.06 
6.135 ±0.03 
3.15 ±0.22 


2.13±0.10 
2.08 ±0.10 
2.67 ±0.25 
4.76 ±0.03 

4.41 ± 3.11 
2.08 ±0.18 


Table 5: Comparison of testing error rate (in %) between SWWAE and other best published results 
on MNIST dataset within semi-supervised setting. 




model / N 



100 

600 

1000 

3000 

Convnet 

LeCun et al. 

19981) 

22.98 

7.86 

6.45 

3.35 

TSYMC 

yapniJt & Vapnik 

. 1 19 

98) 

16.81 

6.16 

5.38 

3.45 

CAE 

Ritai et al.ji 

20TTF 

1 


13.47 

6.3 

4.77 

3.22 

MTC 

Kitai et at. 

201 la 

) 


12.03 

5.13 

3.64 

2.57 

FE-UAE 1 Lee 1 2U 





10.49 

5.03 

3.46 

2.69 

W lA-AL dMaktizani & Lrey 

2014|) 

- 

2.37 

1.92 

- 

M1+M2 

Kingma et al. (|2U14 

T—' 

3.33 ±0.14 

2.59 ±0.05 

2.40 ± 0.02 

2.18 ±0.04 

LadderJNetwork (Kasmus et at 

. |2015aj) 

1.06 ±0.37 

- 

0.84 ± 0.08 

- 


SWWAE without dropout 9.17 ±0.11 4.16 ±0.11 3.39 ± 0.01 2.50 ± 0.01 

SWWAE with dropout 8.71 ± 0.34 3.31 ± 0.40 2.83 ±0.10 2.10 ±0.22 


Aside from semi-supervised setting, we also explore SWWAE training on full labeled training 
dataset in which we find that SWWAE achieves a better testing error rate 0 . 71 % versus 0 . 76 % 
obtained by Convnet under same configuration. 

We reason that SWWAE not working so well as Ladder networks [Rasmus et al.| ( |2015a| l is due to 
the fact that reconstructing MNIST digits is overly easy for SWWAE. Assume we have an one layer 
SWWAE with one pooling and unpooling layer implemented into two pathway respectively. Since 
MNIST is a roughly binary dataset (0/1) and thus within unpooling stage, decoding doesn’t neces¬ 
sarily demand the information from “what” for reconstruction; i.e., it could get perfect reconstruc¬ 
tion by pinning 1 on the positions indicated by “where”. Therefore, we believe that reconstructing 
MNIST dataset renders insufficient regularization on the encoding pathway. However, this phe- 
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nomenon won’t happen on other natural image datasets, such as CIFAR or STL-10 where we show 
good results by SWWAE. 
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